We generated a FlowSOM map using all donors and recipients of the St Louis cohort:
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/cyto/3_backbones/backbone_2_D&Rall/fsom_rd.RData")
PlotStars(UpdateNodeSize(fsom_rd$FlowSOM, maxNodeSize = 8, reset = TRUE),
markers = names(prettyMarkerNames)[which(prettyMarkerNames%in% c("CD11a","CD16","CD127","CD3","CD4","CD45RA","CD8a","HLADR","CD19",
"CD38","CD161","CCR7","CD27","CCR4","CCR5","CD5","CXCR3","Fas",
"foxP3","CD24","CXCR5"))],
view = "MST")
FlowSOM’s clusters were grouped into 40 metaclusters, using FlowSOM’s function:
These metaclusters were manually annotated by Laetitia:
I extract informations at the couple level, by comparing donors and recipients from the same couple, to extract two new features: - gender compatibility (donor gender -> recipient gender) - age of the recipient when he/ she recieved the graft
Here’s a sample of the resulting table:
## [1] 1
## [1] 1
## [1] 1
## [1] 2
## [1] 22524
## COUPLENUMBER GENDER DATEOFSAMPLE DOB GROUP
## R2498 1 F 2016-08-05 1954-12-05 non_tolerant
## D2497 1 F 2014-07-09 1962-07-06 non_tolerant
## R1503 3 M 2016-02-12 1991-05-19 primary_tolerant
## D1502 3 F 2013-12-13 1990-04-03 primary_tolerant
## R2032 6 M 2016-05-13 1971-04-27 secondary_tolerant
## D2031 6 M 2014-04-04 1958-03-24 secondary_tolerant
## CMVStatus gender_comp cmv_comp age_recip
## R2498 1 FF 01 22524
## D2497 0 FF 01 22524
## R1503 1 FM 11 9035
## D1502 1 FM 11 9035
## R2032 0 MM 00 16453
## D2031 0 MM 00 16453
We then visualise the donors and recipients on a same PCA, which is built on the patient’s cell profiles as identified in the FlowSOM map above. We measure the distances between donors and recipients from the same couple on the PCA. It looks like the distance on the PCA between non-tolerant donors and recipients is higher than between oher tolerance groups.
The variables that are the most represented in the two first PCA components are the following:
## PC1
## 1_Conventional DCs (100%) -0.13670715
## 2_CD8 TEM (47%) / TEMRA (42%) / TCM (4%) / TSCM (3%) -0.73094803
## 6_CD8 T cells FoxP3+ CXCR3+ CCR7+ FAS+ -0.14383930
## 14_CD4 Th1-Th17 TEM-TEMRA like (90%) / DP Th1-Th17 TEMRA like (10%) -0.10652909
## 26_CD4 naives (90%) / CD4 TCM (10%) 0.60808985
## 27_CD4 Treg 0.07776088
## 33_B naives (100%) 0.09102204
## PC2
## 1_Conventional DCs (100%) 0.80831893
## 2_CD8 TEM (47%) / TEMRA (42%) / TCM (4%) / TSCM (3%) -0.38647190
## 6_CD8 T cells FoxP3+ CXCR3+ CCR7+ FAS+ -0.09248087
## 14_CD4 Th1-Th17 TEM-TEMRA like (90%) / DP Th1-Th17 TEMRA like (10%) -0.05289846
## 26_CD4 naives (90%) / CD4 TCM (10%) -0.32062872
## 27_CD4 Treg -0.06626228
## 33_B naives (100%) 0.16217140
We can try to build a model to classify the couples between primary, secondary and non tolerant.
In order to have one variable per couple for every metacluster, we substract the percentage of cells of the recipient to the percentage of cells of the donor from a same couple. The resulting variables show how different a donor and recipient from the same couple are.
We can try to classify couples into the 3 tolerance groups (primary, secondary and non-tolerant) based on how different the donor and recipient profiles are:
##
## Call:
## randomForest(formula = group ~ ., data = pctgs_meta_couple)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 38.24%
## Confusion matrix:
## non_tolerant primary_tolerant secondary_tolerant
## non_tolerant 19 2 0
## primary_tolerant 5 1 1
## secondary_tolerant 3 2 1
## class.error
## non_tolerant 0.0952381
## primary_tolerant 0.8571429
## secondary_tolerant 0.8333333
Since classifying into three classes doesn’t seem feasible, we try to classify couples into tolerant and non tolerant::
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 32.35%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 17 4 0.1904762
## tolerant 7 6 0.5384615
The classification results aren’t great, almost all patients are classified as “non_tolerant”
We can also try to classify between tolerant patients only (primary versus secondary):
##
## Call:
## randomForest(formula = group ~ ., data = df_tol, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 61.54%
## Confusion matrix:
## primary_tolerant secondary_tolerant class.error
## primary_tolerant 4 3 0.4285714
## secondary_tolerant 5 1 0.8333333
But it seems like, based on the percentages in FlowSOM’s metaclusters, we’re not able to make a distinction between primary and secondary tolerant couples.
We can also look at the percentage of cells expressing specific functional markers in FlowSOM’s metaclusters, based on the FlowJo workspaces of Laetitia:
Now I have one matrix per patient that contains the % of positive cells for each marker in each metacluster, I need to reshape into one matrix per marker, with patients in rows and % pos per metacluster in columns.
## /41BB+.1 /41BB+.2 /41BB+.3 /41BB+.4 /41BB+.5
## D2031 0.02981294 0.012849341 0.000000000 0.01442308 0.01634675
## D1073 0.09837363 0.053754081 0.045454545 0.04541305 0.07909091
## D1502 0.07960824 0.056295893 0.005050505 0.02677735 0.06823556
## D2497 0.02777778 0.009722222 0.015151515 0.01017294 0.01254647
## R1044 0.05871846 0.029285581 0.017988253 0.03950860 0.16666667
## R1503 0.01593081 0.003912083 0.003418803 0.01249869 0.02120777
Again, we can build a PCA to see how fat the donors and recipients from a same couple are, but based only on the functional markers this time.
The data seem to separate into two main groups in the PCA. I try to identify the cause of this separation. All samples from before September 2017 are on the right of the PCA, all the others are on the left. What changed in this period? - a marker? - the gates of positivity?
We can see which features are pushing the patients into the two groups. The group that’s pushed at the bottom right mainly seems to be caused by the expression of 41BB, OX40, Tim3 and PD1 in the cluster 32:
The group that’s pushed at the top left mainly seems to be caused by the expression of CD38, CD24 and ICOS in the cluster 32:
Since the expression of markers in the cluster 32 seems to draw all the variability, we’ll rescale the features: This seems to remove the two groups.
We can see which features are pushing these outliers away. Mainly the expression of phenotypic markers in the cluster 37 (CD4 TSCM).
Now the PCA has been rescaled, we can again plot it for couples:
We can merge the information about the percentages in the FlowSOM metaclusters and the percentage of positive cells for the functional markers in these metaclusters:
And we run a random forest model on phenotypic and functional markers : But it seems like the information contained in the CyTOF data isn’t sufficient to classify between Primary, secondary and non tolerant couples.
pheno_funct <- cbind(pctgs_meta_couple[,-ncol(pctgs_meta_couple)], df_f_couple)
rf_pheno_funct <- randomForest::randomForest(group~., pheno_funct,
mtry = 50, ntree = 10000)
rf_pheno_funct
##
## Call:
## randomForest(formula = group ~ ., data = pheno_funct, mtry = 50, ntree = 10000)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 50
##
## OOB estimate of error rate: 41.18%
## Confusion matrix:
## non_tolerant primary_tolerant secondary_tolerant
## non_tolerant 20 1 0
## primary_tolerant 7 0 0
## secondary_tolerant 4 2 0
## class.error
## non_tolerant 0.04761905
## primary_tolerant 1.00000000
## secondary_tolerant 1.00000000
Trying to classify between non tolerant and tolerant couples leads to slightly better results
rf <- rf_tol_non_tol(df_orig = pheno_funct, mtry = 70, ntree = 10000)
rf
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 70
##
## OOB estimate of error rate: 20.59%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 18 3 0.1428571
## tolerant 4 9 0.3076923
But the data doesn’t allow us to classify between tolerant patients only (primary versus secondary):
rf_tol <- rf_tol1_tol2(df_orig = pheno_funct, mtry = 70, ntree = 10000)
rf_tol
##
## Call:
## randomForest(formula = group ~ ., data = df_tol, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 70
##
## OOB estimate of error rate: 76.92%
## Confusion matrix:
## primary_tolerant secondary_tolerant class.error
## primary_tolerant 2 5 0.7142857
## secondary_tolerant 5 1 0.8333333
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
I then compute the permutation distribution for non vs tolerant couples. Each variable (one of the fourty populations found + percentage of cells that express a certain functional marker in these fourty populations) is tested, to see if it can be used to separate tolerant from no tolerant patients. We also try including the recipients age and the gender compatibility in the model, to see if some variables are still informative even after “correcting” for these “confounding factors”.
Compute the median per group for every gene, to use it later in the graphs:
# The permutation values were also computed for the feature "group", per error
# -> I remove the last perm value (number 561)
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/cyto/rd/perm_vals_TNT.RData")
perm_vals <- perm_vals[-length(perm_vals)]
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/cyto/rd/norm_data_TNT.RData")
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
Selected features for the couples (with a threshold of 0.90), after computing permutation distributions for every feature:
## Note: zip::zip() is deprecated, please use zip::zipr() instead
## [1] "147 features were selected with a 0.90 threshold in the St Louis cohort."
Which of these features were also found in the Cryostem cohort?
## [1] 27
## [1] "X30_CD4.TCM.Th2.Like...Treg..20.50.."
## [2] "X.CD24..2_CD8.TEM..47.....TEMRA..42.....TCM..4.....TSCM..3.."
## [3] "X.CD24..14_CD4.Th1.Th17.TEM.TEMRA.like..90.....DP.Th1.Th17.TEMRA.like..10.."
## [4] "X.CD24..38_B.transitional..64..."
## [5] "X.CD38..1_Conventional.DCs..100.."
## [6] "X.CD38..5_NK.cells..98.....T.cells.DN.TEMRA..2.."
## [7] "X.CD38..6_CD8.T.cells.FoxP3..CXCR3..CCR7..FAS."
## [8] "X.CD38..7_CD8.TSCM..67...CD8.TCM..33.."
## [9] "X.CD38..10_T.population.CD8...DP.CD4..FoxP3...40.70..."
## [10] "X.CD38..11_DN.T.cells...CD8.low.TEMRA"
## [11] "X.CD38..13_CD4.TCM...TEM.Th2.like.FoxP3...6.30.."
## [12] "X.CD38..14_CD4.Th1.Th17.TEM.TEMRA.like..90.....DP.Th1.Th17.TEMRA.like..10.."
## [13] "X.CD38..16_T.population.Naive.TSCM"
## [14] "X.CD38..17_CD8...87...DP..10....TEMRA..70....TSCM..30....CXCR5..B.markers"
## [15] "X.CD38..19_DN..86...CD8low.13...TSCM.like.FoxP3...30.à.90.."
## [16] "X.CD38..20_NK.cells..66.....Conventional.DCs...some.moDCs..34.."
## [17] "X.CD38..23_DN..86.....CD8low..13...TSCM.like"
## [18] "X.CD38..24_CD4.TCM.Th17.like"
## [19] "X.CD38..25_B.naives..90.....B.memory..10.."
## [20] "X.CD38..27_CD4.Treg"
## [21] "X.CD38..28_B.naives..91.....B.memory..8.....B.transitional..2.."
## [22] "X.CD38..31_CD4.Treg"
## [23] "X.CD38..32_CD4.Treg...B.markers"
## [24] "X.CD38..33_B.naives..100.."
## [25] "X.CD38..37_CD4.TSCM..50.....CD4.TEMRA..50.....B.markers"
## [26] "X.ICOS..29_DP.Treg"
## [27] "X.PD1..11_DN.T.cells...CD8.low.TEMRA"
Save the selected features for the St Louis couples:
pctgs_sel_ft_couples_90 <- norm_data[,c("group", "COUPLENUMBER", selected_ft_rd_qt[sel_common])]
save(pctgs_sel_ft_couples_90,
file = "../data/cyto/rd/pctgs_sel_ft_couples_TNT_90.RData")
Graph of the selected metabolites that are common between the two cohorts:
## IGRAPH 4067cdb UN-- 27 24 --
## + attr: name (v/c), value (e/n)
## + edges from 4067cdb (vertex names):
## [1] X.CD38..1_Conventional.DCs..100.. --X.CD38..5_NK.cells..98.....T.cells.DN.TEMRA..2..
## [2] X.CD38..5_NK.cells..98.....T.cells.DN.TEMRA..2..--X.CD38..20_NK.cells..66.....Conventional.DCs...some.moDCs..34..
## [3] X.CD38..1_Conventional.DCs..100.. --X.CD38..6_CD8.T.cells.FoxP3..CXCR3..CCR7..FAS.
## [4] X.CD38..5_NK.cells..98.....T.cells.DN.TEMRA..2..--X.CD38..6_CD8.T.cells.FoxP3..CXCR3..CCR7..FAS.
## + ... omitted several edges
We can also see how many of these features had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
We can try to build new models to classify tolerant from non tolerant couples based on the features that were selected and common between the two cohorts: Random forest on the selected features at the couple level:
colnames(norm_data) <- make.names(colnames(norm_data))
set.seed(1)
rf1 <- rf_tol_non_tol(df_orig = norm_data[,c("group", make.names(selected_ft_rd_qt[sel_common]))], mtry = 10, ntree = 1000, package_2use = "ranger")
rf1$confusion.matrix
## predicted
## true non_tolerant tolerant
## non_tolerant 17 4
## tolerant 5 8
paste0("prediction error: ", rf1$prediction.error*100)
## [1] "prediction error: 26.4705882352941"
Now that we have selected features that seemed to play a role in tolerance that were common to both cohorts (with a threhold over 0.90), we decide to investigate into these features to see if some of them are related to the age of the recipients, to the gender compatibility between the donor and recipient, to both or to none of these two causes.
I first build new models using the features that were kept in the two cohorts as informative when comparing tolerant and non tolerant patients:
new_models <- lapply(seq_along(make.names(selected_ft_rd_qt[sel_common])), function(idx){
dta_tmp <- norm_data[,c(make.names(selected_ft_rd_qt[sel_common])[idx],
"group", "age_recip", "gender_comp")]
colnames(dta_tmp)[1] <- "Var"
fit1 <- glm(group ~ Var, data=dta_tmp,trace=F, family = "binomial")
fit2 <- glm(group ~ gender_comp + Var,
data=dta_tmp,trace=F, family = "binomial")
fit3 <- glm(group ~ age_recip + Var,
data=dta_tmp,trace=F, family = "binomial")
fit4 <- glm(group ~ gender_comp + age_recip + Var,
data=dta_tmp,trace=F, family = "binomial")
list(fit1, fit2, fit3, fit4)
})
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
plots_gender_age(pdf_name = "../plots/cyto/r&d/age_and_gender/tol_nontol_0.9.pdf",
norm_data = norm_data,
features = make.names(selected_ft_rd_qt[sel_common]))
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## quartz_off_screen
## 2
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
I then compute the permutation distribution to identify variables that could be used to separate primary from secondary tolerant couples.
Compute the median per group for every gene, to use it later in the graphs:
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/cyto/rd/perm_vals_PTST.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/cyto/rd/norm_data_PTST.RData")
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
Selected features for the couples (with a threshold of 0.90), after computing permutation distributions for every metabolite:
qt <- unlist(map(perm_vals, 1))
# barplot(qt)
densityplot(qt)
selected_ft_rd_qt <- colnames(norm_data)[which(qt>.90)]
selected_ft_rd_qt <- selected_ft_rd_qt[!is.na(selected_ft_rd_qt)]
# selected_ft_rd_qt
write.xlsx(selected_ft_rd_qt,
file = "../data/cyto/rd/perm_val_PTST_90_thresh.xlsx")
print(paste0(length(selected_ft_rd_qt),
" features were selected with a 0.90 threshold in the St Louis cohort."))
## [1] "57 features were selected with a 0.90 threshold in the St Louis cohort."
Which of these features are common to the Cryostem cohort?
sel_louis <- read.xlsx("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/cyto/rd/perm_val_PTST_90_thresh.xlsx", colNames = FALSE)
sel_cryo <- read.xlsx("../data/cyto_national/test_stlouis_backbone/rd/perm_val_PTST_90_thresh.xlsx", colNames = FALSE)
sel_common <- which(sel_louis$X1 %in% sel_cryo$X1)
length(sel_common)
## [1] 11
comm <- sel_louis$X1[sel_common]
comm
## [1] "X.CD24..35_Conventional.DCs...90.."
## [2] "X.HLADR..39_B.naive..50...B.memory..50...or.unspecified.B.cells"
## [3] "X.ICOS..9_MoDCs..99.....CD4.TEM..1.."
## [4] "X.IL10..8_DP.Treg.TSCM.like."
## [5] "X.IL10..10_T.population.CD8...DP.CD4..FoxP3...40.70..."
## [6] "X.IL10..16_T.population.Naive.TSCM"
## [7] "X.IL10..31_CD4.Treg"
## [8] "X.IL10..35_Conventional.DCs...90.."
## [9] "X.OX40..5_NK.cells..98.....T.cells.DN.TEMRA..2.."
## [10] "X.OX40..10_T.population.CD8...DP.CD4..FoxP3...40.70..."
## [11] "X.OX40..36_CD4.TSCM"
Save the selected features for the St Louis couples:
norm_data_tmp <- norm_data %>%
rownames_to_column("COUPLENUMBER")
pctgs_sel_ft_couples_90 <- norm_data_tmp[,c("group", "COUPLENUMBER", comm)]
save(pctgs_sel_ft_couples_90,
file = "../data/cyto/rd/pctgs_sel_ft_couples_PTST_90.RData")
Graph of these selected features: The selected features that differed mostly between secondary tolerant donors and recipients from the same couple are colored in blue, the ones that differed most between primary tolerant donors and colored recipients are in green.
We can also see how many of these features had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
Now that we have selected features that should help us classify primary and secondary tolerant couples, we can see that a RF model built on these selected features is better than the one built on all features:
mat_rf <- pctgs_sel_ft_couples_90 %>%
column_to_rownames("COUPLENUMBER")
set.seed(1)
rf <- ranger::ranger(group~., data = mat_rf, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 6 1
## secondary_tolerant 1 5
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 15.3846153846154"
Now that we have selected features that seemed to play a role in tolerance that were common to both cohorts (with a threhold over 0.90), we decide to investigate into these features to see if some of them are related to the age of the recipients, to the gender compatibility between the donor and recipient, to both or to none of these two causes.
I first build new models using the features that were kept in the two cohorts as informative when comparing tolerant and non tolerant patients:
new_models <- lapply(seq_along(make.names(comm)), function(idx){
dta_tmp <- norm_data[,c(make.names(comm)[idx],
"group", "age_recip", "gender_comp")]
colnames(dta_tmp)[1] <- "Var"
fit1 <- glm(group ~ Var, data=dta_tmp,trace=F, family = "binomial")
fit2 <- glm(group ~ gender_comp + Var,
data=dta_tmp,trace=F, family = "binomial")
fit3 <- glm(group ~ age_recip + Var,
data=dta_tmp,trace=F, family = "binomial")
fit4 <- glm(group ~ gender_comp + age_recip + Var,
data=dta_tmp,trace=F, family = "binomial")
list(fit1, fit2, fit3, fit4)
})
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
plots_gender_age(pdf_name = "../plots/cyto/r&d/age_and_gender/prim_sec_0.9.pdf",
norm_data = norm_data,
features = comm)
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## quartz_off_screen
## 2
I load the fcs files and select only the files corresponding to the recipients:
Isolate only the percentage profiles of the recipients, ie for every recipient, the percentage of cells that mapped to the 40 metaclusters annotated by Laetitia.
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/cyto/3_backbones/backbone_2_D&Rall/pctgs_meta_rd_automatic_clustering.RData")
pctgs_recip <- pctgs_meta_rd[names(recip_names),]
colnames(pctgs_recip) <- labels
save(pctgs_recip, file = "../data/cyto/3_backbones/backbone_2_D&Rall/pctgs_meta_recip.RData")
We can visualise the patients FlowSOM profiles (the percentage of cells of the patients that map to the different metaclusters) in a PCA or tSNE map, to see which patients are most similar:
PCA :
We observe a group of mainly tolerant recipients on the right (high PC1 values). We also observe the group of high CMV non tolerant patients on the left.
We can also color the PCA by CMV compatibility:
tSNE : again, the group of patients that was slightly isolated on the left in the PCA is also isolated in the tSNE plot:
## Warning: Column `Id.Cryostem.R` joining character vector and factor,
## coercing into character vector
What is special about these “low PC1” recipients? We can see on the following heatmap that they are CMV positive. They seem to have way more CD8 TEM/ TEMRA cells, slightly more CD8 Treg cells, and less Conventional DCs and B naive cells.
## Joining, by = "Id.Cryostem.R"
## Warning: Column `Id.Cryostem.R` joining character vector and factor,
## coercing into character vector
I tag these patients to visualise them later on:
high_PC1 <- rep(0, nrow(samp_recip))
high_PC1[which(rownames(samp_recip) %in% c("R598", "R836", "03R", "R219", "R690",
"R830", "R2798"))] <- 1
samp_recip <- samp_recip %>%
rownames_to_column("rownames") %>%
mutate(high_PC1 = high_PC1) %>%
column_to_rownames("rownames")
save(samp_recip, file = "../data/cyto/3_backbones/backbone_2_D&Rall/samp_recip.RData")
Here are boxplots of the metaclusters that significantly differ between the “normal” non-tolerant recipients and the “high_PC1” non tolerant recipients: The “normal” non-tolerant recipients are represented in red, the “high_PC1” non- tolerant recipients are represented in light blue.
I extract the ratios of cells that are positive for the functional markers, based on the thresholds that Laetitia has provided:
We can also see how the high_PC1 patients functional markers expression differ from the other non_tolerant patients: The “normal” non-tolerant recipients are represented in red, the “high_PC1” non- tolerant recipients are represented in light blue.
We can merge information about the percentages in the FlowSOM metaclusters and the percentage of positive cells for the functional markers in these metaclusters:
## Joining, by = "Id.Cryostem.R"
## Warning: Column `Id.Cryostem.R` joining character vector and factor,
## coercing into character vector
In the PCA, the first principal component seem to push tolerant recipients more to the right, and non tolerant patients more to the lft, with two outliers: 03R and R836.
Again, we can see which variables are driving the PCs:
## PC1_features
## /CTLA4+_34_CD4 TCM Th2 like -0.09121716
## /PD1+_37_CD4 TSCM (50%) / CD4 TEMRA (50%) + B markers -0.08784124
## /OX40+_21_B naives -0.08718681
## /ICOS+_37_CD4 TSCM (50%) / CD4 TEMRA (50%) + B markers -0.08686915
## /41BB+_37_CD4 TSCM (50%) / CD4 TEMRA (50%) + B markers -0.08432972
## /LagT+_40_CD4 TSCM like + B markers -0.08430312
## /Tim3+_37_CD4 TSCM (50%) / CD4 TEMRA (50%) + B markers -0.08385365
## /41BB+_40_CD4 TSCM like + B markers -0.08339801
## /ICOS+_24_CD4 TCM Th17 like -0.08269454
## /CTLA4+_24_CD4 TCM Th17 like -0.08264044
## PC2_features
## /Tim3+_22_B memory -0.09753983
## /Tim3+_33_B naives (100%) -0.09433103
## /Tim3+_2_CD8 TEM (47%) / TEMRA (42%) / TCM (4%) / TSCM (3%) -0.09196915
## /41BB+_21_B naives -0.09093288
## /ICOS+_39_B naive (50%)/B memory (50%) or unspecified B cells -0.09043190
## /Tim3+_14_CD4 Th1-Th17 TEM-TEMRA like (90%) / DP Th1-Th17 TEMRA like (10%) -0.09020090
## /41BB+_33_B naives (100%) -0.09014497
## /HLADR+_39_B naive (50%)/B memory (50%) or unspecified B cells 0.08996948
## /PD1+_33_B naives (100%) -0.08975123
## /ICOS+_28_B naives (91%) / B memory (8%) / B transitional (2%) -0.08970743
We can try to build a model to see if we would be able to classify non_tolerant, primary and secondary tolerant recipients based on the joint information on the phenotypic and functional markers : We can see that all most non tolerant patients are correctly classified, but 5 primary tolerant recipients are wrongly classified as non or secondary tolerant, and all secondary tolerant patients are wrongly classified.
## Joining, by = "Id.Cryostem.R"
## Warning: Column `Id.Cryostem.R` joining character vector and factor,
## coercing into character vector
##
## Call:
## randomForest(formula = group ~ ., data = pheno_funct, mtry = 50, ntree = 10000)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 50
##
## OOB estimate of error rate: 35.29%
## Confusion matrix:
## non_tolerant primary_tolerant secondary_tolerant
## non_tolerant 20 1 0
## primary_tolerant 3 2 2
## secondary_tolerant 5 1 0
## class.error
## non_tolerant 0.04761905
## primary_tolerant 0.71428571
## secondary_tolerant 1.00000000
We can then try to make the classification problem easier, and try to classify only non tolerant from tolerant recipients: In this case, the classification is more accurate: 17 of the 21 non tolerant recipients and 10 out of the 13 tolerant recipients are correctly classified.
set.seed(1)
rf2_pf <- rf_tol_non_tol(df_orig = pheno_funct, mtry = 70, ntree = 10000)
rf2_pf
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 70
##
## OOB estimate of error rate: 20.59%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 17 4 0.1904762
## tolerant 3 10 0.2307692
We can see which are the main features that help to classify the recipients between the tolerant and non tolerant group:
## non_tolerant tolerant
## X23_DN..86.....CD8low..13...TSCM.like 0.01 0.03
## X26_CD4.naives..90.....CD4.TCM..10.. 0.02 0.02
## X.CD38._5_NK.cells..98.....T.cells.DN.TEMRA..2.. 0.02 0.03
## MeanDecreaseAccuracy
## X23_DN..86.....CD8low..13...TSCM.like 0.02
## X26_CD4.naives..90.....CD4.TCM..10.. 0.02
## X.CD38._5_NK.cells..98.....T.cells.DN.TEMRA..2.. 0.02
## MeanDecreaseGini
## X23_DN..86.....CD8low..13...TSCM.like 1.06
## X26_CD4.naives..90.....CD4.TCM..10.. 0.85
## X.CD38._5_NK.cells..98.....T.cells.DN.TEMRA..2.. 1.12
It seems like we cannot really classify primary from secondary tolerant recipients based on the phenotypic and functional CyTOF markers:
set.seed(1)
rftol_f <- rf_tol1_tol2(df_orig = pheno_funct, mtry = 50, ntree = 10000)
rftol_f
##
## Call:
## randomForest(formula = group ~ ., data = df_tol, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 50
##
## OOB estimate of error rate: 38.46%
## Confusion matrix:
## primary_tolerant secondary_tolerant class.error
## primary_tolerant 6 1 0.1428571
## secondary_tolerant 4 2 0.6666667
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
It looks like the distribution of the CyTOF features among the patients is quite right skewed, so I try if log2 transforming it would yield to better results:
# replace zeros by half of the smallest value, and then
# log2 transform all CyTOF features:
norm_data_orig <- norm_data
nd <- norm_data %>% select(-group, -age_recip, -gender_comp)
norm_data <- apply(nd, 2, function(x){
if(all(x>=0)==FALSE){
x[x<0] <- 0
}
smallest <- min(x[x>0])
x[x==0] <- smallest/2
log2(x)
}) %>% as.data.frame %>%
rownames_to_column("rn") %>%
mutate(group = norm_data_orig$group,
age_recip = norm_data_orig$age_recip,
gender_comp = norm_data_orig$gender_comp) %>%
column_to_rownames("rn")
I then compute the permutation distribution for non vs tolerant recipients.
Compute the median per group for every gene, to use it later in the graphs:
perm_vals <- readRDS(paste0(recip_path,"perm_vals_TNT.RDS"))
norm_data <- readRDS(paste0(recip_path, "norm_data_TNT.RDS"))
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
Selected features for the recipients (with a threshold of 0.95), after computing permutation distributions for every feature:
selected_ft_recip_qt <- select_features(perm_vals = perm_vals,
norm_data = norm_data,
threshold = 0.95,
file_path = recip_path,
file_name = "perm_val_TNT_95_thresh.xlsx")
print(paste0(length(selected_ft_recip_qt),
" features were selected with a 0.95 threshold in the St Louis cohort."))
## [1] "137 features were selected with a 0.95 threshold in the St Louis cohort."
We can then see which of these features are common to the Cryostem cohort:
## [1] "12 features were commonly found in both cohorts"
We can see how many of these features had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "12 features out of the 12 common selected features have the same behaviour in both cohorts."
## [1] "X23_DN..86.....CD8low..13...TSCM.like"
## [2] "X.CD24._30_CD4.TCM.Th2.Like...Treg..20.50.."
## [3] "X.CD38._6_CD8.T.cells.FoxP3..CXCR3..CCR7..FAS."
## [4] "X.CD38._7_CD8.TSCM..67...CD8.TCM..33.."
## [5] "X.CD38._10_T.population.CD8...DP.CD4..FoxP3...40.70..."
## [6] "X.CD38._19_DN..86...CD8low.13...TSCM.like.FoxP3...30.à.90.."
## [7] "X.CD38._23_DN..86.....CD8low..13...TSCM.like"
## [8] "X.CD38._33_B.naives..100.."
## [9] "X.CTLA4._2_CD8.TEM..47.....TEMRA..42.....TCM..4.....TSCM..3.."
## [10] "X.CTLA4._10_T.population.CD8...DP.CD4..FoxP3...40.70..."
## [11] "X.CTLA4._20_NK.cells..66.....Conventional.DCs...some.moDCs..34.."
## [12] "X.CTLA4._24_CD4.TCM.Th17.like"
Now that we have selected features that should help us classify non and tolerant recipients, we can see that a RF model built on these selected features gives similar results to the one built on all features:
dta_tmp <- norm_data[,c("group", common_features$common_features)]
set.seed(1)
rf <- ranger::ranger(group~., data = dta_tmp, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true non_tolerant tolerant
## non_tolerant 17 4
## tolerant 3 10
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 20.5882352941176"
Selected features for the recipients (with a threshold of 0.9), after computing permutation distributions for every feature:
selected_ft_recip_qt <- select_features(perm_vals = perm_vals,
norm_data = norm_data,
threshold = 0.90,
file_path = recip_path,
file_name = "perm_val_TNT_90_thresh.xlsx")
print(paste0(length(selected_ft_recip_qt),
" features were selected with a 0.90 threshold in the St Louis cohort."))
## [1] "190 features were selected with a 0.90 threshold in the St Louis cohort."
We can then see which of these features are common to the Cryostem cohort:
## [1] "25 features were commonly found in both cohorts"
We can see how many of these features had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "24 features out of the 25 common selected features have the same behaviour in both cohorts."
## [1] "X4_CD8.Naives.CXCR3...60....CD8.naives..28...CD8.TSCM..16...CD8.TCM..7.."
## [2] "X7_CD8.TSCM..67...CD8.TCM..33.."
## [3] "X23_DN..86.....CD8low..13...TSCM.like"
## [4] "X.CD24._2_CD8.TEM..47.....TEMRA..42.....TCM..4.....TSCM..3.."
## [5] "X.CD24._4_CD8.Naives.CXCR3...60....CD8.naives..28...CD8.TSCM..16...CD8.TCM..7.."
## [6] "X.CD24._30_CD4.TCM.Th2.Like...Treg..20.50.."
## [7] "X.CD25._3_CD8.TCM.Tc2.like..76.....DP.TCM..24.."
## [8] "X.CD38._6_CD8.T.cells.FoxP3..CXCR3..CCR7..FAS."
## [9] "X.CD38._7_CD8.TSCM..67...CD8.TCM..33.."
## [10] "X.CD38._10_T.population.CD8...DP.CD4..FoxP3...40.70..."
## [11] "X.CD38._16_T.population.Naive.TSCM"
## [12] "X.CD38._17_CD8...87...DP..10....TEMRA..70....TSCM..30....CXCR5..B.markers"
## [13] "X.CD38._19_DN..86...CD8low.13...TSCM.like.FoxP3...30.à.90.."
## [14] "X.CD38._20_NK.cells..66.....Conventional.DCs...some.moDCs..34.."
## [15] "X.CD38._23_DN..86.....CD8low..13...TSCM.like"
## [16] "X.CD38._25_B.naives..90.....B.memory..10.."
## [17] "X.CD38._27_CD4.Treg"
## [18] "X.CD38._33_B.naives..100.."
## [19] "X.CD38._35_Conventional.DCs...90.."
## [20] "X.CTLA4._2_CD8.TEM..47.....TEMRA..42.....TCM..4.....TSCM..3.."
## [21] "X.CTLA4._10_T.population.CD8...DP.CD4..FoxP3...40.70..."
## [22] "X.CTLA4._20_NK.cells..66.....Conventional.DCs...some.moDCs..34.."
## [23] "X.CTLA4._24_CD4.TCM.Th17.like"
## [24] "X.PD1._24_CD4.TCM.Th17.like"
Graph of the selected features that are common between the two cohorts: The features that were correlated (ie expressed in similar ways among the recipients) are linked in the graph. The features that were more expressed in the non tolerant St Louis recipients are colored in red, while the features that were more expressed in tolerant recipients are colored in blue.
## IGRAPH 39dba27 UN-- 25 18 --
## + attr: name (v/c), value (e/n)
## + edges from 39dba27 (vertex names):
## [1] X4_CD8.Naives.CXCR3...60....CD8.naives..28...CD8.TSCM..16...CD8.TCM..7.. --X23_DN..86.....CD8low..13...TSCM.like
## [2] X.CD24._2_CD8.TEM..47.....TEMRA..42.....TCM..4.....TSCM..3.. --X.CD24._4_CD8.Naives.CXCR3...60....CD8.naives..28...CD8.TSCM..16...CD8.TCM..7..
## [3] X.CD24._4_CD8.Naives.CXCR3...60....CD8.naives..28...CD8.TSCM..16...CD8.TCM..7..--X.CD24._30_CD4.TCM.Th2.Like...Treg..20.50..
## + ... omitted several edges
We can try to build a model to classify tolerant from non tolerant recipients of the St Louis cohort, using the selected features that were common between the two cohorts:
dta_tmp <- common_features$local_df
set.seed(1)
nb_features = length(common_features$common_features)
rf1 <- rf_tol_non_tol(df_orig = dta_tmp,
mtry = round(sqrt(nb_features)), ntree = 1000, package_2use = "ranger")
rf1$confusion.matrix
## predicted
## true non_tolerant tolerant
## non_tolerant 16 5
## tolerant 2 11
paste0("prediction error: ", rf1$prediction.error*100)
## [1] "prediction error: 20.5882352941176"
Now that we have selected features that seemed to play a role in tolerance that were common to both cohorts (with a threhold over 0.90), we decide to investigate into these features to see if some of them are related to the age of the recipients, to the gender compatibility between the donor and recipient, to both or to none of these two causes.
I first build new models using the features that were kept in the two cohorts as informative when comparing tolerant and non tolerant patients:
new_models <- lapply(seq_along(common_features$common_features), function(idx){
dta_tmp <- norm_data[,c(common_features$common_features[idx],
"group", "age_recip", "gender_comp")]
colnames(dta_tmp)[1] <- "Var"
fit1 <- glm(group ~ Var, data=dta_tmp,trace=F, family = "binomial")
fit2 <- glm(group ~ gender_comp + Var,
data=dta_tmp,trace=F, family = "binomial")
fit3 <- glm(group ~ age_recip + Var,
data=dta_tmp,trace=F, family = "binomial")
fit4 <- glm(group ~ gender_comp + age_recip + Var,
data=dta_tmp,trace=F, family = "binomial")
list(fit1, fit2, fit3, fit4)
})
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
plots_gender_age(pdf_name = "../plots/cyto/recip/age_and_gender/tol_nontol_0.9.pdf",
norm_data = norm_data,
features = common_features$common_features)
## Warning: Transformation introduced infinite values in continuous x-axis
## quartz_off_screen
## 2
plots_gender_age(pdf_name = "../plots/cyto/recip/age_and_gender/tol_nontol_0.9_behavior.pdf",
norm_data = norm_data,
features = common_beh_features)
## Warning: Transformation introduced infinite values in continuous x-axis
## quartz_off_screen
## 2
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
I then compute the permutation distribution for primary versus secondary tolerant recipients.
Compute the median per group for every gene, to use it later in the graphs:
perm_vals <- readRDS(paste0(recip_path, "perm_vals_PTST.RDS"))
norm_data <- readRDS(paste0(recip_path, "norm_data_PTST.RDS"))
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
Selected features for the recipients (with a threshold of 0.95), after computing permutation distributions for every feature:
selected_ft_recip_qt <- select_features(perm_vals = perm_vals,
norm_data = norm_data,
threshold = 0.95,
file_path = recip_path,
file_name = "perm_val_PTST_95_thresh.xlsx")
print(paste0(length(selected_ft_recip_qt),
" features were selected with a 0.95 threshold in the St Louis cohort."))
## [1] "39 features were selected with a 0.95 threshold in the St Louis cohort."
Which of these features are common to the Cryostem cohort?
## [1] "1 features were commonly found in both cohorts"
We can see how many of these features had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "1 features out of the 1 common selected features have the same behaviour in both cohorts."
## [1] "X.CD38._29_DP.Treg"
Selected features for the recipients (with a threshold of 0.90), after computing permutation distributions for every feature:
selected_ft_recip_qt <- select_features(perm_vals = perm_vals,
norm_data = norm_data,
threshold = 0.90,
file_path = recip_path,
file_name = "perm_val_PTST_90_thresh.xlsx")
print(paste0(length(selected_ft_recip_qt),
" features were selected with a 0.90 threshold in the St Louis cohort."))
## [1] "71 features were selected with a 0.90 threshold in the St Louis cohort."
Which of these features are common to the Cryostem cohort?
## [1] "4 features were commonly found in both cohorts"
We can see how many of these features had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "2 features out of the 4 common selected features have the same behaviour in both cohorts."
## [1] "X.CD38._16_T.population.Naive.TSCM"
## [2] "X.CD38._29_DP.Treg"
We can try to build a model to classify the St Louis primary and secondary tolerant recipients using the features that were common to both cohorts:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 6 1
## secondary_tolerant 2 4
## [1] "prediction error: 23.077"
It seems like we are able to classify the primary and secondary tolerant recipients more accurately that when we were trying to classify the couples.
Now that we have selected features that seemed to play a role in tolerance that were common to both cohorts (with a threhold over 0.90), we decide to investigate into these features to see if some of them are related to the age of the recipients, to the gender compatibility between the donor and recipient, to both or to none of these two causes.
I first build new models using the features that were kept in the two cohorts as informative when comparing tolerant and non tolerant patients:
new_models <- lapply(seq_along(common_features$common_features), function(idx){
dta_tmp <- norm_data[,c(common_features$common_features[idx],
"group", "age_recip", "gender_comp")]
colnames(dta_tmp)[1] <- "Var"
fit1 <- glm(group ~ Var, data=dta_tmp,trace=F, family = "binomial")
fit2 <- glm(group ~ gender_comp + Var,
data=dta_tmp,trace=F, family = "binomial")
fit3 <- glm(group ~ age_recip + Var,
data=dta_tmp,trace=F, family = "binomial")
fit4 <- glm(group ~ gender_comp + age_recip + Var,
data=dta_tmp,trace=F, family = "binomial")
list(fit1, fit2, fit3, fit4)
})
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
plots_gender_age(pdf_name = "../plots/cyto/recip/age_and_gender/prim_sec_0.9.pdf",
norm_data = norm_data,
features = common_features$common_features)
## Warning: Transformation introduced infinite values in continuous x-axis
## quartz_off_screen
## 2
plots_gender_age(pdf_name = "../plots/cyto/recip/age_and_gender/prim_sec_0.9_behavior.pdf",
norm_data = norm_data,
features = common_beh_features)
## Warning: Transformation introduced infinite values in continuous x-axis
## quartz_off_screen
## 2
Load the fcs files and select the files of interest:
fcs_dir <- "~/Documents/VIB/Projects/Integrative_Paris/documents_22:02:18/CYTOF_David_Michonneau/fcs/"
fcs_names <- list.files(fcs_dir, pattern="^2.*fcs$")
fcs_names[["R1044"]] <- "20170530_R1044_01_normalized_livecellswithoutbeads__livecells__ADN__time__Ungated_____ICOScleaned__Ungated_"
names(fcs_names) <- gsub("^[0-9]*_([^_]*)_.*", "\\1", fcs_names)
rd_names <- fcs_names[-which(names(fcs_names)%in%c("12R","18R","12D","D1071","D369"))]
## import metadat info :
dataPath <- "~/Documents/VIB/Projects/Integrative_Paris/documents_22:02:18/CYTOF_David_Michonneau/Data synthesis local cohort Saint-Louis 032018_modified.xlsx"
samp_rd <- import_patient_info(data_synthesis_file = dataPath,
patient_names = rd_names,
patient_type = "donor")
complete_couples <- names(table(samp_rd$COUPLENUMBER))[which(table(samp_rd$COUPLENUMBER)==2)]
complete_names <- rownames(samp_rd)[which(samp_rd$COUPLENUMBER %in% complete_couples)]
donor_names_short <- complete_names[grep("D", complete_names)]
donor_names <- rd_names[donor_names_short]
samp_donor <- samp_rd[names(donor_names),]
Isolate only the percentages of the donors:
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/cyto/3_backbones/backbone_2_D&Rall/pctgs_meta_rd_automatic_clustering.RData")
pctgs_donor <- pctgs_meta_rd[names(donor_names),]
colnames(pctgs_donor) <- labels
We can visualise the donors FlowSOM profiles (the percentage of cells of the patients that map to the different metaclusters) in a PCA or tSNE map, to see which donors are most similar:
PCA :
## Warning: Column `Id.Cryostem.R` joining character vector and factor,
## coercing into character vector
The first component of the PCA seems quite informative :
plot(pca_donor_louis)
All tolerance groups seem to be grouped together, with no real observable pattern
Extract the ratios of cells that are positive for the functional markers for the donors only:
Taking together information about the percentages in the FlowSOM metaclusters and the percentage of positive cells for the functional markers in these metaclusters:
## Warning: Column `Id.Cryostem.R` joining character vector and factor,
## coercing into character vector
We can try to build a model to classify the donors between the different tolerance groups: It seems that the CyTOF phenotypic and functional markers are not sufficient to classify the donors in primary, secondary and non tolerant: almost all donors get classified as non tolerant
pheno_funct <- cbind(samp_donor$GROUP, donor_pheno_funct)
colnames(pheno_funct)[1] <- "group"
colnames(pheno_funct) <- make.names(colnames(pheno_funct))
rf_pheno_funct <- randomForest::randomForest(group~., pheno_funct,
mtry = 50, ntree = 10000)
rf_pheno_funct
##
## Call:
## randomForest(formula = group ~ ., data = pheno_funct, mtry = 50, ntree = 10000)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 50
##
## OOB estimate of error rate: 41.18%
## Confusion matrix:
## non_tolerant primary_tolerant secondary_tolerant
## non_tolerant 20 1 0
## primary_tolerant 7 0 0
## secondary_tolerant 6 0 0
## class.error
## non_tolerant 0.04761905
## primary_tolerant 1.00000000
## secondary_tolerant 1.00000000
Trying to classify between non tolerant and tolerant donors doesn’t show better results:
set.seed(1)
rf2_pf <- rf_tol_non_tol(df_orig = pheno_funct, mtry = 50, ntree = 10000)
rf2_pf
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 50
##
## OOB estimate of error rate: 55.88%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 14 7 0.3333333
## tolerant 12 1 0.9230769
Trying to classify between primary and secondary tolerant donors neither:
set.seed(1)
rftol_f <- rf_tol1_tol2(df_orig = pheno_funct, mtry = 50, ntree = 10000)
rftol_f
##
## Call:
## randomForest(formula = group ~ ., data = df_tol, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 50
##
## OOB estimate of error rate: 84.62%
## Confusion matrix:
## primary_tolerant secondary_tolerant class.error
## primary_tolerant 1 6 0.8571429
## secondary_tolerant 5 1 0.8333333
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
I then compute the permutation distribution for non vs tolerant donors.
Compute the median per group for every gene, to use it later in the graphs:
perm_vals <- readRDS(paste0(donor_path, "perm_vals_TNT.RDS"))
norm_data <- readRDS(paste0(donor_path, "norm_data_TNT.RDS"))
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
Selected features for the donors (with a threshold of 0.95), after computing permutation distributions for every feature:
selected_ft_donors_qt <- select_features(perm_vals = perm_vals,
norm_data = norm_data,
threshold = 0.95,
file_path = donor_path,
file_name = "perm_val_TNT_95_thresh.xlsx")
print(paste0(length(selected_ft_donors_qt),
" features were selected with a 0.95 threshold in the Cryostem cohort."))
## [1] "18 features were selected with a 0.95 threshold in the Cryostem cohort."
We can then see which of these features are common to the St louis cohort:
## [1] "0 features were commonly found in both cohorts"
Selected features for the donors (with a threshold of 0.90), after computing permutation distributions for every feature:
selected_ft_donors_qt <- select_features(perm_vals = perm_vals,
norm_data = norm_data,
threshold = 0.90,
file_path = donor_path,
file_name = "perm_val_TNT_90_thresh.xlsx")
print(paste0(length(selected_ft_donors_qt),
" features were selected with a 0.90 threshold in the St Louis cohort."))
## [1] "44 features were selected with a 0.90 threshold in the St Louis cohort."
We can see which features were commonly found between the two cohorts to classify tolerant and non tolerant donors:
## [1] "4 features were commonly found in both cohorts"
We can also see how many of these features had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "2 features out of the 4 common selected features have the same behaviour in both cohorts."
## [1] "X10_T.population.CD8...DP.CD4..FoxP3...40.70..."
## [2] "X.CD38._32_CD4.Treg...B.markers"
We can try to build a model to classify tolerant from non tolerant donors of the St Louis cohort, using the selected features that were common between the two cohorts: But it seems like the donors CyTOF data isn’t sufficient to classify tolerant from non tolerant donors in the St Louis cohort.
set.seed(1)
nb_features = length(common_features$common_features)
rf1 <- rf_tol_non_tol(df_orig = common_features$local_df,
mtry = round(sqrt(nb_features)), ntree = 1000, package_2use = "ranger")
rf1$confusion.matrix
## predicted
## true non_tolerant tolerant
## non_tolerant 17 4
## tolerant 8 5
paste0("prediction error: ", rf1$prediction.error*100)
## [1] "prediction error: 35.2941176470588"
Now that we have selected features that seemed to play a role in tolerance that were common to both cohorts (with a threhold over 0.90), we decide to investigate into these features to see if some of them are related to the age of the recipients, to the gender compatibility between the donor and recipient, to both or to none of these two causes.
I first build new models using the features that were kept in the two cohorts as informative when comparing tolerant and non tolerant patients:
new_models <- lapply(seq_along(common_features$common_features), function(idx){
dta_tmp <- norm_data[,c(common_features$common_features[idx],
"group", "age_recip", "gender_comp")]
colnames(dta_tmp)[1] <- "Var"
fit1 <- glm(group ~ Var, data=dta_tmp,trace=F, family = "binomial")
fit2 <- glm(group ~ gender_comp + Var,
data=dta_tmp,trace=F, family = "binomial")
fit3 <- glm(group ~ age_recip + Var,
data=dta_tmp,trace=F, family = "binomial")
fit4 <- glm(group ~ gender_comp + age_recip + Var,
data=dta_tmp,trace=F, family = "binomial")
list(fit1, fit2, fit3, fit4)
})
plots_gender_age(pdf_name = "../plots/cyto/donors/age_and_gender/tol_nontol_0.9.pdf",
norm_data = norm_data,
features = common_features$common_features)
## quartz_off_screen
## 2
plots_gender_age(pdf_name = "../plots/cyto/donors/age_and_gender/tol_nontol_0.9_behavior.pdf",
norm_data = norm_data,
features = common_beh_features)
## quartz_off_screen
## 2
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
I then compute the permutation distribution for non vs tolerant recipients.
Compute the median per group for every feature, to use it later in the graphs:
perm_vals <- readRDS(paste0(donor_path, "perm_vals_PTST.RDS"))
norm_data <- readRDS(paste0(donor_path, "norm_data_PTST.RDS"))
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
Selected features for the donors (with a threshold of 0.95), after computing permutation distributions for every feature:
selected_ft_donors_qt <- select_features(perm_vals = perm_vals,
norm_data = norm_data,
threshold = 0.95,
file_path = donor_path,
file_name = "perm_val_PTST_95_thresh.xlsx")
print(paste0(length(selected_ft_donors_qt),
" features were selected with a 0.95 threshold in the St Louis cohort."))
## [1] "17 features were selected with a 0.95 threshold in the St Louis cohort."
We can then see which of these features are common to the Cryostem cohort:
## [1] "1 features were commonly found in both cohorts"
Selected features for the recipients (with a threshold of 0.90), after computing permutation distributions for every feature:
selected_ft_donors_qt <- select_features(perm_vals = perm_vals,
norm_data = norm_data,
threshold = 0.90,
file_path = donor_path,
file_name = "perm_val_PTST_90_thresh.xlsx")
print(paste0(length(selected_ft_donors_qt),
" features were selected with a 0.90 threshold in the St Louis cohort."))
## [1] "51 features were selected with a 0.90 threshold in the St Louis cohort."
We can then see which of these features are common to the Cryostem cohort:
## [1] "4 features were commonly found in both cohorts"
We can see how many of these features had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "3 features out of the 4 common selected features have the same behaviour in both cohorts."
## [1] "X.41BB._21_B.naives."
## [2] "X.CD38._12_DN.T.cells..70...CD8..30...TEM.TCM"
## [3] "X.CD38._19_DN..86...CD8low.13...TSCM.like.FoxP3...30.à.90.."
Now that we have selected features that should help us classify non and tolerant recipients, we can see that a RF model built on these selected features gives similar results to the one built on all features:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 5 2
## secondary_tolerant 3 3
## [1] "prediction error: 38.4615384615385"
Now that we have selected features that seemed to play a role in tolerance that were common to both cohorts (with a threhold over 0.90), we decide to investigate into these features to see if some of them are related to the age of the recipients, to the gender compatibility between the donor and recipient, to both or to none of these two causes.
I first build new models using the features that were kept in the two cohorts as informative when comparing tolerant and non tolerant patients:
new_models <- lapply(seq_along(common_features$common_features), function(idx){
dta_tmp <- norm_data[,c(common_features$common_features[idx],
"group", "age_recip", "gender_comp")]
colnames(dta_tmp)[1] <- "Var"
fit1 <- glm(group ~ Var, data=dta_tmp,trace=F, family = "binomial")
fit2 <- glm(group ~ gender_comp + Var,
data=dta_tmp,trace=F, family = "binomial")
fit3 <- glm(group ~ age_recip + Var,
data=dta_tmp,trace=F, family = "binomial")
fit4 <- glm(group ~ gender_comp + age_recip + Var,
data=dta_tmp,trace=F, family = "binomial")
list(fit1, fit2, fit3, fit4)
})
plots_gender_age(pdf_name = "../plots/cyto/donors/age_and_gender/prim_sec_0.9.pdf",
norm_data = norm_data,
features = common_features$common_features)
## quartz_off_screen
## 2
plots_gender_age(pdf_name = "../plots/cyto/donors/age_and_gender/prim_sec_0.9_behavior.pdf",
norm_data = norm_data,
features = common_beh_features)
## quartz_off_screen
## 2
Now that we have selected features at the couple level, at the recipient level and at the donor level, we can try to combine these features:
We first combine the features that were selected as highly informative to classify between tolerant and non tolerant patients from the couples, recipients and donors into one big table:
# couple info:
load("../data/cyto/rd/pctgs_sel_ft_couples_TNT_90.RData")
# add "couple_" in front of the feature names
colnames2keep <- which(colnames(pctgs_sel_ft_couples_90) %in% c("group", "COUPLENUMBER"))
colnames(pctgs_sel_ft_couples_90)[-colnames2keep] <-
gsub("X", "couple_", colnames(pctgs_sel_ft_couples_90)[-colnames2keep])
# combine feature info with couple info about gender_comp...
pctgs_couples <- couple_info %>%
rownames_to_column("Id.Cryostem.R") %>%
select("Id.Cryostem.R", "COUPLENUMBER", "gender_comp", "age_recip") %>%
dplyr::filter(grepl("R", rownames(couple_info))) %>%
inner_join(pctgs_sel_ft_couples_90, by = "COUPLENUMBER")
# recip info:
load("../data/cyto/recip/pctgs_sel_ft_recip_TNT_90.RData")
colnames2keep <- which(colnames(pctgs_sel_ft_recip_90) %in% c("group", "Id.Cryostem.R"))
colnames(pctgs_sel_ft_recip_90)[-colnames2keep] <-
gsub("X", "R_", colnames(pctgs_sel_ft_recip_90)[-colnames2keep])
pctgs_recip <- couple_info %>%
rownames_to_column("Id.Cryostem.R") %>%
select("Id.Cryostem.R", "COUPLENUMBER") %>%
right_join(pctgs_sel_ft_recip_90, by = "Id.Cryostem.R")
# donor info :
load("../data/cyto/donors/pctgs_sel_ft_donor_TNT_90.RData")
colnames2keep <- which(colnames(pctgs_sel_ft_donor_90) %in% c("group", "Id.Cryostem.R"))
colnames(pctgs_sel_ft_donor_90)[-colnames2keep] <-
gsub("X", "D_", colnames(pctgs_sel_ft_donor_90)[-colnames2keep])
pctgs_donor <- couple_info %>%
rownames_to_column("Id.Cryostem.R") %>%
select("Id.Cryostem.R", "COUPLENUMBER") %>%
right_join(pctgs_sel_ft_donor_90, by = "Id.Cryostem.R")
# combine :
pctgs_all <- pctgs_couples %>%
left_join(pctgs_recip) %>%
left_join(pctgs_donor, by = "COUPLENUMBER")
## Joining, by = c("Id.Cryostem.R", "COUPLENUMBER", "group")
We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:
pctgs_all_only <- pctgs_all %>%
select_if(is.numeric) %>%
column_to_rownames("COUPLENUMBER") %>%
select(-"age_recip")
pca_all <- prcomp(pctgs_all_only)
The 2 first components of the PCA seem to hold quite a lot of the variability in the data:
plot(pca_all)
pca_2plot <- as.data.frame(pca_all$x) %>%
mutate(COUPLENUMBER = as.numeric(rownames(pca_all$x))) %>%
left_join(pctgs_all, by = "COUPLENUMBER")
ggplot(pca_2plot, aes(x = PC1, y = PC2, col = as.factor(group.x), label = Id.Cryostem.R.x)) +
geom_point() +
geom_text(aes(label=Id.Cryostem.R.x),hjust=0, vjust=0)
The non tolerant patients seem to have higher PC1 values on the PCA (they are situated more on the right), which suggests that we might be able to classify tolerant from non tolerant patients based on the features we selected.
At least, the selected variables do not seem affected by the batch effect:
samp_recip_tmp <- samp_recip %>% mutate(Id.Cryostem.R = as.character(Id.Cryostem.R))
pca_2plot <- as.data.frame(pca_all$x) %>%
mutate(COUPLENUMBER = as.numeric(rownames(pca_all$x))) %>%
left_join(pctgs_all, by = "COUPLENUMBER") %>%
mutate(Id.Cryostem.R = Id.Cryostem.R.x)
bla <- pca_2plot %>%
left_join(samp_recip_tmp , by = "Id.Cryostem.R")
ggplot(bla, aes(x = PC1, y = PC2, col = as.factor(DATEOFCYTOFEXPERIMENT), label = Id.Cryostem.R)) +
geom_point() +
geom_text(aes(label=Id.Cryostem.R.x),hjust=0, vjust=0)
We can also generate a tSNE on the same data:
tsne_all <- Rtsne::Rtsne(pctgs_all_only, perplexity = 10)
tsne_2plot <- as.data.frame(tsne_all$Y) %>%
magrittr::set_colnames(c("X1", "X2")) %>%
mutate(COUPLENUMBER = as.numeric(rownames(pctgs_all_only))) %>%
left_join(pctgs_all, by = "COUPLENUMBER")
ggplot(tsne_2plot, aes(x = X1, y = X2, col = as.factor(group.x),
label = Id.Cryostem.R.x)) +
geom_point() +
geom_text(aes(label=Id.Cryostem.R.x),hjust=0, vjust=0)
We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:
pctgs_all_RF <- pctgs_all %>%
select(-"COUPLENUMBER", -"gender_comp", -"age_recip") %>%
mutate(group = as.factor(group.x)) %>%
select(-"group.x", -"group.y", -"Id.Cryostem.R.y") %>%
column_to_rownames("Id.Cryostem.R.x")
set.seed(1)
nb_features = ncol(pctgs_all_RF)-1
rf <- rf_tol_non_tol(df_orig = pctgs_all_RF,
mtry = round(sqrt(nb_features)), ntree = 1000, package_2use = "ranger")
rf$confusion.matrix
## predicted
## true non_tolerant tolerant
## non_tolerant 16 5
## tolerant 5 8
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 29.4117647058824"
The classification error is quite high, but we can still see which features were selected in the model to classify tolerant from non tolerant patients in the St Louis cohort:
## R_23_DN..86.....CD8low..13...TSCM.like
## 1.4836536
## R_.CD38._27_CD4.Treg
## 0.9802577
## couple_.CD38..6_CD8.T.cells.FoxP3..Ccouple_CR3..CCR7..FAS.
## 0.8963434
## couple_.CD38..19_DN..86...CD8low.13...TSCM.like.FoxP3...30.à.90..
## 0.8501487
## R_.CD38._19_DN..86...CD8low.13...TSCM.like.FoxP3...30.à.90..
## 0.6412911
## R_4_CD8.Naives.CR_CR3...60....CD8.naives..28...CD8.TSCM..16...CD8.TCM..7..
## 0.6108603
## R_.CD38._20_NK.cells..66.....Conventional.DCs...some.moDCs..34..
## 0.6061123
## couple_.CD38..32_CD4.Treg...B.markers
## 0.5433161
## R_.CD38._35_Conventional.DCs...90..
## 0.5277534
## couple_.CD38..13_CD4.TCM...TEM.Th2.like.FoxP3...6.30..
## 0.5218810
## R_.CD38._6_CD8.T.cells.FoxP3..CR_CR3..CCR7..FAS.
## 0.5064835
## R_.CD38._10_T.population.CD8...DP.CD4..FoxP3...40.70...
## 0.4193420
## couple_.CD38..17_CD8...87...DP..10....TEMRA..70....TSCM..30....Ccouple_CR5..B.markers
## 0.3203844
## R_.CD38._16_T.population.Naive.TSCM
## 0.2887959
## R_.CD38._33_B.naives..100..
## 0.2882176
## R_.CD24._30_CD4.TCM.Th2.Like...Treg..20.50..
## 0.2830151
## couple_.CD38..23_DN..86.....CD8low..13...TSCM.like
## 0.2649505
## R_.CD38._23_DN..86.....CD8low..13...TSCM.like
## 0.2568765
## couple_.CD38..25_B.naives..90.....B.memory..10..
## 0.2384732
## couple_.CD38..16_T.population.Naive.TSCM
## 0.2314222
The variables that were mainly selected as informative in the model are the ones from recipients and couples.
We can visualise the top variables in a heatmap:
mat2plot <- pctgs_all_RF[,c("group", names(sort(rf$variable.importance, decreasing = TRUE))[1:20])] %>%
rownames_to_column("rownames") %>%
arrange(group) %>%
column_to_rownames("rownames")
annot_patients <- as.data.frame(mat2plot$group)
rownames(annot_patients) <- rownames(mat2plot)
colnames(annot_patients) <- "group"
ann_colors <- list(group = c(non_tolerant = "indianred1",
tolerant = "royalblue1"))
pheatmap::pheatmap(mat2plot[,-which(colnames(mat2plot)=="group")],
annotation_row = annot_patients,
annotation_colors = ann_colors[1],
cluster_rows = FALSE,
cluster_cols = FALSE,
main = "Heatmap with patients ordered by group",
scale = "column")
pheatmap::pheatmap(mat2plot[,-which(colnames(mat2plot)=="group")],
annotation_row = annot_patients,
annotation_colors = ann_colors[1],
cluster_rows = TRUE,
main = "Heatmap with patients clustered",
scale = "column")
Reminder: high values for the variables that start with “couple_” mean that the difference of expression between the donor and the recipient of the same couple was high for this specific feature.
We first combine the features that were selected as highly informative to classify between tolerant and non tolerant patients from the couples, recipients and donors into one big table:
# couple info:
load("../data/cyto/rd/pctgs_sel_ft_couples_PTST_90.RData")
# add "couple_" in front of the feature names
colnames2keep <- which(colnames(pctgs_sel_ft_couples_90) %in% c("group", "COUPLENUMBER"))
colnames(pctgs_sel_ft_couples_90)[-colnames2keep] <-
gsub("X", "couple_", colnames(pctgs_sel_ft_couples_90)[-colnames2keep])
# combine feature info with couple info about gender_comp...
pctgs_couples <- couple_info %>%
rownames_to_column("Id.Cryostem.R") %>%
select("Id.Cryostem.R", "COUPLENUMBER", "gender_comp", "age_recip") %>%
dplyr::filter(grepl("R", rownames(couple_info))) %>%
inner_join(pctgs_sel_ft_couples_90 %>% mutate(COUPLENUMBER = as.numeric(COUPLENUMBER)), by = "COUPLENUMBER")
# recip info:
load("../data/cyto/recip/pctgs_sel_ft_recip_PTST_90.RData")
colnames2keep <- which(colnames(pctgs_sel_ft_recip_90) %in% c("group", "Id.Cryostem.R"))
colnames(pctgs_sel_ft_recip_90)[-colnames2keep] <-
gsub("X", "R_", colnames(pctgs_sel_ft_recip_90)[-colnames2keep])
pctgs_recip <- couple_info %>%
rownames_to_column("Id.Cryostem.R") %>%
select("Id.Cryostem.R", "COUPLENUMBER") %>%
right_join(pctgs_sel_ft_recip_90, by = "Id.Cryostem.R")
# donor info :
load("../data/cyto/donors/pctgs_sel_ft_donor_PTST_90.RData")
colnames2keep <- which(colnames(pctgs_sel_ft_donor_90) %in% c("group", "Id.Cryostem.R"))
colnames(pctgs_sel_ft_donor_90)[-colnames2keep] <-
gsub("X", "D_", colnames(pctgs_sel_ft_donor_90)[-colnames2keep])
pctgs_donor <- couple_info %>%
rownames_to_column("Id.Cryostem.R") %>%
select("Id.Cryostem.R", "COUPLENUMBER") %>%
right_join(pctgs_sel_ft_donor_90, by = "Id.Cryostem.R")
# combine :
pctgs_all <- pctgs_couples %>%
left_join(pctgs_recip) %>%
left_join(pctgs_donor, by = "COUPLENUMBER")
## Joining, by = c("Id.Cryostem.R", "COUPLENUMBER", "group")
We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:
pctgs_all_only <- pctgs_all %>%
select_if(is.numeric) %>%
column_to_rownames("COUPLENUMBER") %>%
select(-"age_recip")
pca_all <- prcomp(pctgs_all_only)
The 2 first components of the PCA seem to hold quite a lot of the variability in the data:
plot(pca_all)
pca_2plot <- as.data.frame(pca_all$x) %>%
mutate(COUPLENUMBER = as.numeric(rownames(pca_all$x))) %>%
left_join(pctgs_all, by = "COUPLENUMBER")
ggplot(pca_2plot, aes(x = PC1, y = PC2, col = as.factor(group.x), label = Id.Cryostem.R.x)) +
geom_point() +
scale_color_manual(breaks = c("primary_tolerant", "secondary_tolerant"),
values = c("springgreen3", "royalblue1")) +
geom_text(aes(label=Id.Cryostem.R.x),hjust=0, vjust=0)
The primary tolerant patients seem to have lower PC2 and PC1 values on the PCA (they are situated more on the bottom-left), which suggests that we might be able to classify primary from secondary tolerant patients based on the features we selected.
At least, the selected variables do not seem affected by the batch effect:
samp_recip_tmp <- samp_recip %>% mutate(Id.Cryostem.R = as.character(Id.Cryostem.R))
pca_2plot <- as.data.frame(pca_all$x) %>%
mutate(COUPLENUMBER = as.numeric(rownames(pca_all$x))) %>%
left_join(pctgs_all, by = "COUPLENUMBER") %>%
mutate(Id.Cryostem.R = Id.Cryostem.R.x)
bla <- pca_2plot %>%
left_join(samp_recip_tmp , by = "Id.Cryostem.R")
ggplot(bla, aes(x = PC1, y = PC2, col = as.factor(DATEOFCYTOFEXPERIMENT), label = Id.Cryostem.R)) +
geom_point() +
geom_text(aes(label=Id.Cryostem.R.x),hjust=0, vjust=0)
We can also generate a tSNE on the same data:
tsne_all <- Rtsne::Rtsne(pctgs_all_only, perplexity = 3)
tsne_2plot <- as.data.frame(tsne_all$Y) %>%
magrittr::set_colnames(c("X1", "X2")) %>%
mutate(COUPLENUMBER = as.numeric(rownames(pctgs_all_only))) %>%
left_join(pctgs_all, by = "COUPLENUMBER")
ggplot(tsne_2plot, aes(x = X1, y = X2, col = as.factor(group.x),
label = Id.Cryostem.R.x)) +
geom_point() +
scale_color_manual(breaks = c("primary_tolerant", "secondary_tolerant"),
values = c("springgreen3", "royalblue1")) +
geom_text(aes(label=Id.Cryostem.R.x),hjust=0, vjust=0)
We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:
pctgs_all_RF <- pctgs_all %>%
select(-"COUPLENUMBER", -"gender_comp", -"age_recip") %>%
mutate(group = as.factor(group.x)) %>%
select(-"group.x", -"group.y", -"Id.Cryostem.R.y") %>%
column_to_rownames("Id.Cryostem.R.x")
set.seed(1)
rf <- ranger::ranger(group~., data = pctgs_all_RF, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 6 1
## secondary_tolerant 1 5
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 15.3846153846154"
We can see which features were selected in the model to classify tolerant from non tolerant patients in the St Louis cohort:
## couple_.Ocouple_40..10_T.population.CD8...DP.CD4..FoxP3...40.70...
## 0.62862903
## R_.CD25._28_B.naives..91.....B.memory..8.....B.transitional..2..
## 0.52491620
## R_.IL10._10_T.population.CD8...DP.CD4..FoxP3...40.70...
## 0.48369827
## couple_.IL10..10_T.population.CD8...DP.CD4..FoxP3...40.70...
## 0.48159080
## couple_.IL10..31_CD4.Treg
## 0.46857005
## R_.CD38._29_DP.Treg
## 0.36037966
## R_.CD38._35_Conventional.DCs...90..
## 0.29518529
## D_.CD38._19_DN..86...CD8low.13...TSCM.like.FoxP3...30.à.90..
## 0.28707649
## couple_.IL10..35_Conventional.DCs...90..
## 0.25938634
## D_.CD38._12_DN.T.cells..70...CD8..30...TEM.TCM
## 0.25207939
## couple_.CD24..35_Conventional.DCs...90..
## 0.24985435
## R_.CD38._16_T.population.Naive.TSCM
## 0.24005802
## couple_.Ocouple_40..5_NK.cells..98.....T.cells.DN.TEMRA..2..
## 0.22270605
## couple_.IL10..8_DP.Treg.TSCM.like.
## 0.21788165
## D_.41BB._21_B.naives.
## 0.16527505
## couple_.Ocouple_40..36_CD4.TSCM
## 0.14530924
## D_.IL10._33_B.naives..100..
## 0.14439058
## couple_.IL10..16_T.population.Naive.TSCM
## 0.13586836
## couple_.HLADR..39_B.naive..50...B.memory..50...or.unspecified.B.cells
## 0.13363377
## R_.HLADR._15_T.cells.Th17.TSCM.reg.like
## 0.13222223
## D_.IL10._40_CD4.TSCM.like...B.markers
## 0.12346449
## couple_.ICOS..9_MoDCs..99.....CD4.TEM..1..
## 0.02960931
The variables that were mainly selected as informative in the model are the ones from recipients and couples.
We can visualise the top variables in a heatmap:
mat2plot <- pctgs_all_RF[,c("group", names(sort(rf$variable.importance, decreasing = TRUE))[1:20])] %>%
rownames_to_column("rownames") %>%
arrange(group) %>%
column_to_rownames("rownames")
annot_patients <- as.data.frame(mat2plot$group)
rownames(annot_patients) <- rownames(mat2plot)
colnames(annot_patients) <- "group"
ann_colors <- list(group = c(primary_tolerant = "springgreen3",
secondary_tolerant = "royalblue1"))
pheatmap::pheatmap(mat2plot[,-which(colnames(mat2plot)=="group")],
annotation_row = annot_patients,
annotation_colors = ann_colors[1],
cluster_rows = FALSE,
cluster_cols = FALSE,
main = "Heatmap with patients ordered by group",
scale = "column")
pheatmap::pheatmap(mat2plot[,-which(colnames(mat2plot)=="group")],
annotation_row = annot_patients,
annotation_colors = ann_colors[1],
cluster_rows = TRUE,
main = "Heatmap with patients clustered",
scale = "column")
Reminder: high values for the variables that start with “couple_” mean that the difference of expression between the donor and the recipient of the same couple was high for this specific feature.